cBioPortal Interface and cBioPortal Extraction with R
Purpose
At the end of this session, you will be able to:
- Exctract cBioPortal information by the online interface.
- Exctract cBioPortal information by R with the cbioportalR R package.
Introduction
cBioPortal is a tool that allows you to explore and visualize
molecular data such as DNA, RNA, proteins, which come from multiple
studies whose processed data have been made available to the scientific
community.
You can see the different studies publicly available here: https://www.cbioportal.org/
cBioPortal is distributed under a public license, meaning that the
code is available and we can install a cBioPortal locally, therefore
having our own cBioPortal.
You can see the list of available cBioPortals (local and public) here:
https://installationmap.netlify.app/
Thus, Gustave Roussy has deployed an internal local
cBioPortal, meaning that the information is only available to
Gustave Roussy employees when connected to the GR_Intern
network: https://cbioportal.intra.igr.fr
By default, no study is available. You must make a request for
access to the study of interest to the DAC, using this
form. Then, as soon as you have the green light from the DAC, you
can contact the bioinformatics platform (bigr@gustaveroussy.fr) to obtain your access.
To get the updated list of available studies, sent an email to bigr@gustaveroussy.fr.
For this practical, we will use the public cBioPortal https://www.cbioportal.org/, but the principle is the same with another instance of cBioPortal.
Extract cBioPortal information by the online interface
Connexion
To use the interface go to the cBioPortal website: https://www.cbioportal.org/
Main interface
On the main page of the site we can see:
- a menu bar at the top,
- a request form at the center,
- an insert on the right where the new features and some example queries
are listed.
The request form lists the studies from Gustave Roussy which are
accessible via the cBioPortal.
Currently 5 studies are available.
We can see the name of the studies, followed by the number of samples
for each study, the i logo gives a short description of the
study, the book logo is a link to the publication, then the logo of the
pie chart is a shortcut to the study summary page.
Data Sets tab
This tab lists the studies from Gustave Roussy which are accessible
via the cBioPortal.
These are the same studies as before.
We can see the name of the studies, the link to the publication, the
total number of samples, then the number of samples for each kind of
data.
Here we have “Mutations”, “Copy Number Alteration” (CNA), and “RNA-seq”
data.
But if you click on the drop-down menu on the right, you can see the
other types of data available.
Study selection
From the request form (click on cBioPortal logo to come back to the
Main interface), we can either:
- select a study of interest by clicking directly on the pie chart logo
of the desired study line.
- select one, several or all, studies by checking them, then click on
the “Explore Selected Studies” button on the form.
From the Data Sets tab, we can click on the name of the study of interest.
Of course if there are too many studies available, you can use the
keyword search form.
Regardless of how you select your study of interest, you will be
redirected to the same page.
This new page contains a general banner at the top, displaying the name
of the study (a small download icon to the right of the name allows you
to download all the clinical and genomic data available for this study),
the number of patients and the number of samples.
The various additional tabs and buttons on this page are detailed below.
Summary tab
The summary tab allows you to summarize the data from the selected
study in a few graphs and tables.
This tab is made up of several elements that can be moved, enlarged
(bottom right corner) or deleted.
The elements are dynamic, that is to say that when you pass the mouse
over them, displays can appear. Likewise, in the upper right corner of
each element, a menu may appear offering several options (deletion of
the element, download, additional information,…).
Tables
Tables have 3 main columns:
-the first column: its name varies depending on the information listed
in the table (“Molecular Profile”, “Genes”, “Categories”, etc.).
- the # column (for the number), corresponds
to the number of samples with the characteristic of the first
column.
- the Freq column (for Frequency), corresponds
to the percentage of samples with the characteristic of the first
column.
In some specific table instances we may have other additional
columns:
- Mutated Genes table:
- # Mut: total number of gene mutations. It may be higher
than the number of samples with this mutation because one sample may
have multiple mutations for this gene.
- Structural Variant Genes table:
- # SV: total number of structural variants of the gene.
Likewise, it may be higher than the number of samples with that
structural variant because a sample may have multiple structural
variants for that gene.
- CNA Genes table:
- Cytoband: genomic region where the CNA is located
(cytoband).
- CNA: the type of CNA (AMP: amplification; HOMDEL:
homozygous deletion).
Note that you can filter the tables using the search bar at the bottom of the element, sort in alphabetical or numerical order if you click on the column names.
Graphs
Several types of graphs are available to represent the data: pie charts, histograms, lineplots. or scateplots.
Pie charts allow you to represent discrete data such as gender, ethnicity, number of samples per patient, sample type, etc.
Histograms present continuous data such as age at diagnosis, fraction of altered genome, days to sample collection, etc.
Lineplots present continuous data such as Overall survival, Kaplan-Meier of disease free, etc.
Scaterplots compare 2 continuous data, such as the number of mutations as a function of the fraction of altered genome. This type of graph calculates a correlation score between the 2 variables compared (Spearman and Pearson), as well as the associated p-values.
Note that some graph displays the number of NA values. This is the number of samples for which we do not have the requested information.
More graphes and tables
Not all tables and graphs are displayed by default.
You can click on the Charts button at the top right, wander
through the different menus (“Clinical”, “Genomics”…) to add graphs and
even test the X to Y to make scaterplots, boxplots and
violinplots with variables of your choice.
Sample selection
To make a selection of samples you can click in the tables or in the additional displays given by the graphs when we hover our mouse over them.
If you already have a list of patients or samples of interest you can use the “Custom Selection” button and put your list there.
Note that as soon as you have selected according to a criterion, all the tables/graphs are automatically updated to only present the information related to this selection. Of course you can select according to several criteria.
Clinical Data tab
This tab presents the clinical characteristics of the patients/samples. The available columns depend on the studies.
The Charts button in the Summary tab has become
Columns and allows you to show or hide columns.
There is also a search bar and a download table button.
This tab is up to date with the sample selections made previously.
Other tab
Other tabs may be available depending on the studies, for example in the study Glioblastoma (TCGA, Cell 2013).
In particular, the Heatmap tab which allows you to view
heatmaps already made of the public cBioPortal). The menu on the left
allows you to select a heatmap of interest among those proposed.
We can slightly customize these heatmaps by clicking on them which will redirect us to an editor. In this editor the left display (“Heat Map Detail”) is a zoom of the complete heatmap (“Heat Map Summary”) which is presented on the right. The verse element on the right heatmap corresponds to what we see on the left heatmap.
You can change certain heatmap parameters using the “Parameters” button.
The “CN Segments” tab allows you to explore the copy number variations along the chromosomes with an IGV type display. The color represents the number of copies and each line corresponds to a patient. There are also settings buttons to display only a chromosome or a region, change colors, save the plot,…
You can zoom by double clicking.
And other tab types, including CT Scan, and probably others that I didn’t come across while creating this course.
The “Beta Plots!” tab allows you to make your own comparison graphs
of 2 variables, in a similar way with the Charts button of
the “Summary” tab, but here we can also compare genomic data. The type
of graph depends on the information to be compared.
Groups comparisons
Once you have selected your samples of interest, you can put them in
a group with the Groups button which will save your
selection. So you can make several groups.
Group comparison allows you to compare their clinical and molecular
characteristics.
For example, return to the “Summary” tab, select all the “Male” and make them a group then do the same with the “Female”.
You see that our groups are displayed with a color (here pink for
“Female” and blue for “Male”).
Then still in the Groups button, select the 2 groups that
we have just made, then click on Compare.
A new page opens with new tabs. The number and type of tabs depends on the availability of data from the selected studies.
Overlap tab
The “Overlap” tab allows you to know if you have patients or samples that are present in both groups at the same time (which can happen if you have made the selection on patients with several samples for example), thanks to a cubic Venn graph.
Overall survival and Clinical tabs
The “Survival” tab compares the overall survival of the 2 groups with a Kaplan-Meier graph.
The clinical tab compares the clinical data between the 2 groups with
appropriate statistical tests and presents graphs. Results are sorted by
significance and significant results are in bold.
As usual, you can select the columns to display, download the results
table and graphs, search by word leader, change the graph display.
Note: whether for the Survival or Clinical tab, there is a banner indicating that it is necessary to “Interpret all results with caution, as they can be confounded by many different variables that are not controlled for in these analyses. Consider consulting a statistician.”. So the statistical tests carried out here can give us a first insight but it is better to discuss with a specialist before making any conclusions.
Genomic Alteration tab
The “Genomic Alteration” tab allows you to compare mutations between the 2 groups, in particular the frequency of alteration overall or for certain genes. As usual, you have additional displays if you hover your mouse over the graphs, you can download the graphs and change which genes are displayed.
At the bottom you have a comparison table for each alteration. If you hover over the column names you have explanations displayed, and you with the usual selection and save options.
Mutations Beta! tab
The “Beta Mutations!” tab allows you to compare the mutations between the 2 groups, by representing them along the protein domains. Mutations are represented above or below the protein axis depending on the group of patients to which they belong.
By default it offers you the proteins with the highest mutation
frequency of their gene, but you can choose your protein/gene of
interest. You can also display additional annotations by clicking on the
“Add annotation Tracks” button (they are displayed at the bottom of the
graph). You can view the legend by clicking the “Legend” button at the
top right.
Do not hesitate to hover your mouse over the graphic elements, the
displays are dynamic.
Similar to the “Genomic Alteration” tab, at the bottom you have a table
which summarizes the protein changes and in which group it is enriched,
with a score and significance. Note the presence of the “Annotation”
column which gives you additional information from databases (OncoKB,
CIViC,…).
mRNA, Protein, DNA Methylation tabs and other data
The “mRNA” tab allows you to carry out a differential expression
analysis between our 2 groups. The graph displayed by default is a
volcanoplot. There is also a table at the bottom with the results of the
differential analysis. It is sorted by significance (q-Value) and
significant genes are in bold.
If we click on a gene in this table, a new graph appears with the
expression levels in boxplot format.
Note if you hover over the name of the column named “p-Value”, it indicates that the test used is a Student’s t-test. This is a very good example of the fact that it is necessary to seek the advice of a statician before making the interpretation, because the use of a t-test requires having a lot of samples to compare and other statistical tests are more suitable.
The “Protein” and “DNA Methylation” tabs are identical to the “mRNA” tab but for proteins or methylation.
Other data can be available like Arm-level CNA or Genetic Ancestry, depending on the study.
Search by genes
Instead of selecting the samples/patients, then doing the analyses/tables/graphs, then searching for your genes of interest in each analysis/table/graph, you can indicate your genes from the start and get results already filtered.
To do this, we return to the very first page of the cBioPortal (by clicking on the logo at the top left of the page), we select the study (or studies) of interest, then we click on “Query By Gene”.
The rest of the form unfolds and you can choose the molecular data on which you want to work; then the patients/samples of interest, for example we keep all the samples or only those which have CNA data, or we can put our own list by choosing “User-defined Case List” and putting our list of IDs. Then, we write our genes of interest (separated by a comma if you want to put several), or we can also choose pre-defined lists of genes by clicking on the drop-down menu.
Then we click on “Submit Query”.
If you have a lot of samples, it may take a few minutes before the
following page appears with multiple tabs.
The number and type of tabs depends on the availability of data from the
selected studies.
Oncoprint tab
The first result obtained is an oncoprint.
This representation is widely used in publications analyzing patient
cohorts because it allows a quick overview of the distribution of
genetic alterations in the cohort.
In this graph, each vertical line corresponds to a patient. Then,
horizontally, we have their clinical data (different information
depending on the study and modifiable with the “Tracks” drop-down menu),
and the alteration information according to our genes of interest, with
the percentage of patients with an alteration in these genes and the
type of alteration.
At the top of this table we have a banner with buttons for sorting,
filtering and zooming in/out.
Cancer Type Summary tab
This tab presents the same results but in the form of a cumulative histogram by type of cancer and by gene requested. You can choose the gene and cancer type level to plot with the options at the top of the graph.
Mutual Exclusivity tab
In this tab you can see if the genes in your list are linked. In other words, if one gene is mutated, is the other gene also often mutated (“Co-occurrence”) or on the contrary it never is (“Mutual exclusivity”).
Plots tab
In this tab we allow you to interactively generate graphs combining different types of data.
For example, we can look at the expression of TP53 according to the type of Copy Number alteration of TP53, by coloring the TP53 alteration type.
Or for example, we could look at the variation in the protein level of our gene depending on its gene expression level (RNA).
Mutations tab
This tab corresponds to the tab named “Beta Mutations!” obtained when comparing several groups (except that here we have no comparison). We have the graph of the distribution of mutations along the proteins of our genes of interest, with the corresponding table.
Co-Expression tab
The “Co-Expression” tab allows us to know if our genes of interest
have genes that are co-expressed with it, in other words, genes whose
expression evolves in a similar way across patients.
The results are in tabular form with all genes tested, the most
significant at the top. We can also plot a scatterplot of the expression
of the 2 genes and a correlation score is calculated. If the mutation
status is available the points are colored according to this status.
Example with the Glioblastoma Multiforme study (TCGA GDC, 2025) on the public cBioPortal:
Comparison/Survival tab
This tab itself has several tabs, which are the same as when we
compared the 2 groups of samples previously (Overlap, Survival,
Clinical, Genomic Alterations, mRNA, Protein, DNA Methylation,…), but
here to compare the samples with at least one alteration of our genes of
interest against the samples without alteration.
You can also change the groups to compare:
- samples with at least one alteration among all our genes of
interest,
- samples without any alteration among our genes of interest,
- samples with at least one alteration in a chosen gene.
CN Segments tab
The graphs are the same as for the “CN Segments” tab seen previously, but here we have buttons for each of our genes to go directly to their genomic regions more easily.
Pathways tab
This tab allows us to display the signaling pathways of our genes of
interest.
There are 2 pathway databases available and each has its own viewer:
- PathwayMapper shows pathways from over fifty cancer related pathways
and provides a collaborative web-based editor for creating new
ones.
It displays network signaling pathways as well as the frequency of alteration of genes present in these pathways (if the information is available).
On the right is the table of results with all the signaling pathways identified (sometimes you can have several pathways for a single gene).
>Please note if the “Show TCGA PanCancer Atlas pathways only” option is checked then only the pathways presented in the TCGA PanCancer Atlas pathways will be shown (and not all the pathways in PathwayMapper).
>Note that you can move the pathway bricks to arrange them wherever you want.
- NDEx shows 1,352 pathways by aggregating several other databases:
NCI-PID, Signor, WikiPathways, CPTAC, CCMI and NeST. Here we have the
list of signaling pathway on the left and their display on the
right.
The display and legend are different depending on channels because they depend on the database used. But our genes of interest are always boxed in pink.
You can click on the name of the pathway (in the title displayed in blue) to get information on it, as well as on the genes (nodes) and the links between genes (edges). But the graphs are also dynamic so we can get this information by clicking on the genes or on the links between the genes directly.
Download tab
On the public cBioPortal only, this tab allows you to download all the clinical and genomic data from the samples selected according to our genes of interest.
Make graphs on your own data?
Another interesting feature of cBioPortail is to visualize your own data. You can do: - the oncoprint. - the mutationmapper.
To do this, click on the “Visualize Your Data” tab (at the very top of the page), then on “OncoPrinter” or “MutationMapper”.
Please note, all data must be anonymized before upload.
Oncoprint
This allows you to make the same encoprint as seen previously. You
can copy and paste your tables directly into the corresponding areas or
load them via a file. To get an idea of the format you can click on
“Load example data” and/or for an explanation of the format on “View
data format”. You can put genomic, clinical and/or heatmap
information. Once the information has been entered, you can optionally
choose an order of appearance of the genes or samples on the graph; and
click on “Submit”.
The oncoplot appears (with genomic information, then
clinical, then heatmap) as well as a Mutual Exclusivity
analysis.
Mutationmapper
This allows you to create a graph representing mutations along
protein domains.
You choose your reference genome (the one used for your analysis), then
enter your analysis data.
To see the expected format, you can click on one of the examples given
in “Load example data” and/or for an explanation of the format, click on
“Data format”.
Then click on “Visualize”.
Save your session
To store your virtual studies and groups you can sign in with your
Google or Microsoft account on public cBioPortal (similarly with your
Gustave Roussy account on the Gustave Roussy cBioPortal). This will
allow you to access your studies and groups from any computer, and
cBioPortal will also remember your study view charts preferences for
each study (i.e. order of the charts, type of charts and
visibility).
Login is optional on the public cBioPortal and not required to access
any of the other features of cBioPortal.
Also, you can share your patient selection by creating a web link to your selection. Click on the “Save/Share Virtual Study” button, then give a name to you selection, then click on “Save” or “Share”.
You can save the link to your virtutal study, to share it with your
colleagues.
Also if you come back to the very first page of the cBioPortal (by clicking on the logo at the top left of the page), you can see your new virtutal study.
Extract cBioPortal information by R
But what about reproducibility of you research on this web site? Do
you remember each button you clicked, especially for study selection and
sample/patient selection?
The easiest way is to make an R script.
There are 2 R packages that allow you to retrieve data from
cBioPortal: cbioportalR
and cBioPortalData.
Unfortunately, neither of these 2 packages allows you to recover all the
data in an easily analyzable format, nor to create graphs (the graphs
will have to be created by yourself).
cbioportalR allows to retrieve data in easily filterable
data.frame format but only clinical data, mutations, copy numbers (with
segments) and structural variants. Basically, it does not recover gene
expression, methylation, protein levels, etc. In addition, it only
provides dataframes that are not linked together, so if we filter the
mutation table to only keep certain patients (for example), we must
filter the clinical data accordingly on the same set of patients.
cBioPortalData
allows you to retrieve data in a complex format called
MultiAssayExperiment which is an assembly of several other objects of
type SummarizedExperiment and RaggedExperiment. This assembly makes it
possible to link clinical data to experimental data, and to filter
everything in a single block (so we can filter clinical and experimental
data at the same time). In my experience, only structural variant data
is not retrieved with this package, however, it heavily filters the
available information. For example, for mutations, it only loads the
position of the mutation (for each mutated gene for each sample) but we
no longer have the information concerning the mutated nucleotide
(A?T?C?G?), the number of mutated and normal reads (so it is impossible
to compute the VAF), nor any annotation (missens? impact on the protein?
already known from the annotation databases (COSMIC…)?).
Welcome to the real deep world of bioinformatics. Everything is not always obvious, but we always end up finding a solution with a little agility and imagination.
The shortcomings of these two packages are probably due to the file formats which can be formatted differently depending on the studies. For example for rna, there can be several versions of expression tables: raw counts, in log2, in zscore, or even by gene or by transcript.
Limitations are also attributable to the API (application programming interface) created by the cBioPortal team. An API is an additional functionality of websites allowing information to be retrieved from the site using the command line. So these packages will query the website via the API, but if the API characteristics are limited, the packages will be limited too.
For this course, we suggest using the cbioportalR
package as a basis, then using some in-house functions to recover the
missing information. The results will be in a dataframe format, easy to
filter with functions dedicated to dataframes.
I strongly advise you to have understood the courses on Tables Manipulations and the one on Making graphs with ggplot2. We will consider their content as acquired for the rest of this course.
Setup environment and token
We load cbioportalR R package to get data, and
dplyr/tidyr R packages to manipulate data.
# Installation
install.packages("cbioportalR")#Loading library
library(cbioportalR)
library(dplyr)##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(tidyr)Before accessing data you will need to connect to a cBioPortal database and set your base URL for the R session.
For the Gustave Roussy cBioPortal
If you want to use the Gustave Roussy cBioPortal, you need to connect you to the GR_Intern network and establish the connection between your R and the intern cBioPortal. For this last point, you need to add a token to your ~/.Renviron file to authorize access:
- Go on the cBioPortal interface (https://cbioportal.intra.igr.fr), then retrieve your token (top right):
- Modify your R environment by editing the ~/.Renviron file to add a
new variable like:
CBIOPORTAL_TOKEN = ‘YOUR_TOKEN’
#To open the file ~/.Renviron
usethis::edit_r_environ()- Then connect you to “cbioportal.intra.igr.fr” with the
set_cbioportal_db()function:
base_url <- "cbioportal.intra.igr.fr"
set_cbioportal_db(base_url)Token is valid for 30 days. After this time it will be necessary to regenerate it.
For the public cBioPortal
If you want to use the public cBioPortal database instance (https://www.cbioportal.org), you do not need a token to
access this public website, and just connect you to “https://www.cbioportal.org” base url with the
set_cbioportal_db() function:
base_url <- "https://www.cbioportal.org"
set_cbioportal_db(base_url)## v You are successfully connected!
## v base_url for this R session is now set to "www.cbioportal.org/api"
In the rest of the course we will use the public cBioPortal database instance, but everything works similarly for the Gustave Roussy cBioPortal.
Identifying available studies
Now that we are successfully connected, we may want to view available studies for our chosen database to find the correct study_id corresponding to the data we want to pull. You can view all studies available in your database with the following:
studies <- available_studies()
head(studies)## # A tibble: 6 x 14
## studyId name description publicStudy pmid citation groups status importDate
## <chr> <chr> <chr> <lgl> <chr> <chr> <chr> <int> <chr>
## 1 cesc_tc~ Cerv~ "Cervical ~ TRUE 2962~ TCGA, C~ "PUBL~ 0 2024-12-2~
## 2 sarc_tc~ Sarc~ "Sarcoma T~ TRUE 2962~ TCGA, C~ "PUBL~ 0 2024-12-2~
## 3 crc_ori~ Colo~ "Combined ~ TRUE 3938~ Wala, J~ "" 0 2025-06-3~
## 4 crc_ori~ Colo~ "Combined ~ TRUE 3938~ Wala, J~ "" 0 2025-06-3~
## 5 crc_ori~ Colo~ "Combined ~ TRUE 3938~ Wala, J~ "" 0 2025-06-3~
## 6 lusc_tc~ Lung~ "Lung Squa~ TRUE 2962~ TCGA, C~ "PUBL~ 0 2024-12-2~
## # i 5 more variables: allSampleCount <int>, readPermission <lgl>,
## # resourceCounts <list>, cancerTypeId <chr>, referenceGenome <chr>
We get several pieces of information such as the name of the studies, their description, their publication, the date of import, the number of samples,…
The number of available studies:
nrow(studies)## [1] 50
Hum, only 50 studies? That not a lot! Where are the ~500 studies
presented on the web site?
This is a limitation of the API, it returns only the last 50 studies by
default.
Unfortunately the cBioPortal package does not offer a solution, so we
will have to code that by hand. I’m giving the code to you ready-made,
no need to understand it in detail, that’s not the subject of the course
here.
#load packages
library(httr) #to send HTTP requests to the API
library(jsonlite) #to convert JSON responses to R objects
#cBioPortal API base address
studies_url <- paste0(base_url, "/api/studies")
#send a GET request to the API
res <- GET(
studies_url,
query = list(
pageSize = 1000, # 1000 studies requested (there are ~500 studies, so that's sufficient for this database)
pageNumber = 0 # the first page
),
accept_json() # request JSON format
)
#format into data.frame
all_studies <- fromJSON(content(res, "text", encoding = "UTF-8"))
#print the first lines of the data.frame
head(all_studies)## name
## 1 Metastatic Solid Cancers (UMich, Nature 2017)
## 2 Stomach Adenocarcinoma (TCGA, Firehose Legacy)
## 3 Colorectal Cancer (CAS Shanghai, Cancer Cell 2020)
## 4 MSK-IMPACT Heme Tumors (MSK, 2022)
## 5 Pancreatic Adenocarcinoma (ICGC, Nature 2012)
## 6 Esophagogastric Cancer (MSK, Clin Cancer Res 2022)
## description
## 1 Whole-exome and -transcriptome sequencing of 500 adult patients with metastatic solid tumor/primary normal pairs of diverse lineage and biopsy site.
## 2 TCGA Stomach Adenocarcinoma. Source data from <A HREF="http://gdac.broadinstitute.org/runs/stddata__2016_01_28/data/STAD/20160128/">GDAC Firehose</A>. Previously known as TCGA Provisional.
## 3 Whole-exome sequencing of 146 colorectal tumor/normal pairs from a chinese cohort, covering 70 metastatic and 76 non-metastatic colorectal cancer patients.
## 4 Targeted sequencing of 2383 myeloid and lymphoid neoplasms and their matched normals via MSK-IMPACT Heme panel.
## 5 Whole-exome sequencing of 99 pancreatic samples and their matched normals.
## 6 Targeted sequencing of 237 esophagogastric tumor/normal pairs via MSK-IMPACT platform.
## publicStudy pmid citation groups status
## 1 TRUE 28783718 Robinson et al. Nature 2017 0
## 2 TRUE <NA> <NA> PUBLIC 0
## 3 TRUE 32888432 Li et al. Cancer Cell 2020 PUBLIC 0
## 4 TRUE <NA> <NA> 0
## 5 TRUE 23103869 Biankin et al. Nature 2012 0
## 6 TRUE 35377946 Smita et al. Clin Cancer Res 2022 0
## importDate allSampleCount readPermission resourceCounts
## 1 2024-12-09 10:46:46 1 TRUE NULL
## 2 2025-06-17 12:13:18 1 TRUE NULL
## 3 2024-12-20 11:02:57 1 TRUE NULL
## 4 2024-12-16 10:56:34 1 TRUE NULL
## 5 2025-06-11 22:02:02 1 TRUE NULL
## 6 2024-12-04 18:47:18 1 TRUE NULL
## studyId cancerTypeId referenceGenome
## 1 metastatic_solid_tumors_mich_2017 mixed hg19
## 2 stad_tcga stad hg19
## 3 coadread_cass_2020 coadread hg19
## 4 heme_msk_impact_2022 mixed hg19
## 5 paad_icgc paad hg19
## 6 egc_msk_tp53_ccr_2022 egc hg19
#print the number of rows of the data.frame
nrow(all_studies)## [1] 519
Great! Now there are over 500 studies available!
We can plot the top 20 of cancerTypeId with the largest number of studies:
library(ggplot2)
#get the top 20 cancer types
top_cancers <- all_studies %>%
count(cancerTypeId) %>%
top_n(20, n)
#plot
ggplot(top_cancers, aes(x = reorder(cancerTypeId, n), y = n)) +
geom_bar(stat = "identity", fill = "steelblue") +
coord_flip() + # flip axes for readability
labs(
title = "Number of Studies per Cancer Type",
x = "Cancer Type",
y = "Number of Studies"
) +
theme_classic()Choose your study of interest
If we want to search for all studies related to glioblastoma (a.k.a. that have the term “glioblastoma” in their name):
as.data.frame(subset(all_studies, grepl("glioblastoma", all_studies$name, ignore.case = TRUE)))## name
## 113 Glioblastoma (CPTAC, Cell 2021)
## 185 Glioblastoma Multiforme (TCGA, PanCancer Atlas)
## 265 Glioblastoma Multiforme (TCGA GDC, 2025)
## 266 Glioblastoma (TCGA, Cell 2013)
## 274 Glioblastoma Multiforme (TCGA, Firehose Legacy)
## 420 Glioblastoma (Columbia, Nat Med. 2019)
## 500 Glioblastoma (TCGA, Nature 2008)
## description
## 113 Proteogenomic and metabolomic characterization of human glioblastoma. Whole genome or whole exome sequencing of 99 samples. Generated by CPTAC.
## 185 Glioblastoma Multiforme TCGA PanCancer data. The original data is <a href="https://gdc.cancer.gov/about-data/publications/pancanatlas">here</a>. The publications are <a href="https://www.cell.com/pb-assets/consortium/pancanceratlas/pancani3/index.html">here</a>.
## 265 TCGA Glioblastoma Multiforme. Source data from <A HREF="https://gdc.cancer.gov">NCI GDC</A> and generated in Aug 2025 using <A HREF="https://cda.readthedocs.io/en/latest/">Cancer Data Aggregator</A>.
## 266 Whole-exome and/or whole-genome sequencing of 291 of the 577 glioblastoma tumor/normal pairs. The Cancer Genome Atlas (TCGA) Glioblastoma Project.
## 274 TCGA Glioblastoma Multiforme. Source data from <A HREF="http://gdac.broadinstitute.org/runs/stddata__2016_01_28/data/GBM/20160128/">GDAC Firehose</A>. Previously known as TCGA Provisional.
## 420 Whole-exome sequencing of 32 out of 42 glioblastomas patients with matched normals.
## 500 Targeted sequencing in 91 of the 206 primary glioblastoma tumors (143 with matched normals) from the Cancer Genome Atlas (TCGA) Glioblastoma Project.
## publicStudy
## 113 TRUE
## 185 TRUE
## 265 TRUE
## 266 TRUE
## 274 TRUE
## 420 TRUE
## 500 TRUE
## pmid
## 113 33577785
## 185 29625048,29596782,29622463,29617662,29625055,29625050,29617662,30643250,32214244,29625049,29850653,36334560
## 265 <NA>
## 266 24120142
## 274 <NA>
## 420 30742119
## 500 18772890
## citation groups status importDate
## 113 Wang et al. Cell 2021 0 2025-10-21 16:28:33
## 185 TCGA, Cell 2018 PUBLIC;PANCAN 0 2025-10-21 16:06:43
## 265 <NA> PUBLIC 0 2025-10-21 16:33:34
## 266 TCGA, Cell 2013 0 2025-10-21 15:41:40
## 274 <NA> PUBLIC 0 2025-10-21 15:32:47
## 420 Zhao et al. Nat Med 2019 0 2025-10-21 16:14:34
## 500 TCGA, Nature 2008 PUBLIC 0 2025-10-21 15:38:53
## allSampleCount readPermission
## 113 1 TRUE
## 185 1 TRUE
## 265 1 TRUE
## 266 1 TRUE
## 274 1 TRUE
## 420 1 TRUE
## 500 1 TRUE
## resourceCounts
## 113 NULL
## 185 IDC_OHIF_V2, CT Scan, CT Scan, PATIENT, 1, TRUE, 592, 585, gbm_tcga_pan_can_atlas_2018
## 265 NULL
## 266 NULL
## 274 NULL
## 420 NULL
## 500 NULL
## studyId cancerTypeId referenceGenome
## 113 gbm_cptac_2021 difg hg19
## 185 gbm_tcga_pan_can_atlas_2018 difg hg19
## 265 gbm_tcga_gdc difg hg38
## 266 gbm_tcga_pub2013 difg hg19
## 274 gbm_tcga difg hg19
## 420 gbm_columbia_2019 difg hg19
## 500 gbm_tcga_pub difg hg19
Note: grepl is a function that allows you to search for one or more words in a vector. It returns a logical vector (TRUE/FALSE) which is given to the subset function which will make the selection in the table. In addition, we ignore the case because in computing upper/lower case letters are discriminatory by default (“Glioblastoma” is different from “glioblastoma”).
For the example, we choose the study named “Glioblastoma Multiforme (TCGA, PanCancer Atlas)”, whose identifier is “gbm_tcga_pan_can_atlas_2018”:
study_id <- "gbm_tcga_pan_can_atlas_2018"To get more information on our studies, we can do the following:
study_info <- get_study_info(study_id) %>% t()
study_info## [,1]
## name "Glioblastoma Multiforme (TCGA, PanCancer Atlas)"
## description "Glioblastoma Multiforme TCGA PanCancer data. The original data is <a href=\"https://gdc.cancer.gov/about-data/publications/pancanatlas\">here</a>. The publications are <a href=\"https://www.cell.com/pb-assets/consortium/pancanceratlas/pancani3/index.html\">here</a>."
## publicStudy "TRUE"
## pmid "29625048,29596782,29622463,29617662,29625055,29625050,29617662,30643250,32214244,29625049,29850653,36334560"
## citation "TCGA, Cell 2018"
## groups "PUBLIC;PANCAN"
## status "0"
## importDate "2025-10-21 16:06:43"
## allSampleCount "1"
## sequencedSampleCount "397"
## cnaSampleCount "575"
## mrnaRnaSeqSampleCount "0"
## mrnaRnaSeqV2SampleCount "160"
## mrnaMicroarraySampleCount "0"
## miRnaSampleCount "0"
## methylationHm27SampleCount "0"
## rppaSampleCount "231"
## massSpectrometrySampleCount "0"
## completeSampleCount "145"
## readPermission "TRUE"
## treatmentCount "448"
## structuralVariantCount "123"
## resourceCounts.resourceId "IDC_OHIF_V2"
## resourceCounts.displayName "CT Scan"
## resourceCounts.description "CT Scan"
## resourceCounts.resourceType "PATIENT"
## resourceCounts.priority "1"
## resourceCounts.openByDefault "TRUE"
## resourceCounts.sampleCount "592"
## resourceCounts.patientCount "585"
## resourceCounts.studyId "gbm_tcga_pan_can_atlas_2018"
## studyId "gbm_tcga_pan_can_atlas_2018"
## cancerTypeId "difg"
## cancerType.name "Diffuse Glioma"
## cancerType.dedicatedColor "Gray"
## cancerType.shortName "DIFG"
## cancerType.parent "brain"
## cancerType.cancerTypeId "difg"
## referenceGenome "hg19"
To view the list of data available in this study:
study_profiles <- available_profiles(study_id) %>% as.data.frame()
study_profiles## molecularAlterationType genericAssayType datatype
## 1 GENERIC_ASSAY ARMLEVEL_CNA CATEGORICAL
## 2 GENERIC_ASSAY GENETIC_ANCESTRY LIMIT-VALUE
## 3 COPY_NUMBER_ALTERATION <NA> DISCRETE
## 4 COPY_NUMBER_ALTERATION <NA> LOG2-VALUE
## 5 GENERIC_ASSAY METHYLATION LIMIT-VALUE
## 6 MUTATION_EXTENDED <NA> MAF
## 7 MRNA_EXPRESSION <NA> CONTINUOUS
## 8 MRNA_EXPRESSION <NA> Z-SCORE
## 9 MRNA_EXPRESSION <NA> Z-SCORE
## 10 PROTEIN_LEVEL <NA> LOG2-VALUE
## 11 PROTEIN_LEVEL <NA> Z-SCORE
## 12 STRUCTURAL_VARIANT <NA> SV
## name
## 1 Putative arm-level copy-number from GISTIC
## 2 Genetic Ancestry
## 3 Putative copy-number alterations from GISTIC
## 4 Log2 copy-number values
## 5 Methylation (HM27 and HM450 merge)
## 6 Mutations
## 7 mRNA Expression, RSEM (Batch normalized from Illumina HiSeq_RNASeqV2)
## 8 mRNA expression z-scores relative to diploid samples (RNA Seq V2 RSEM)
## 9 mRNA expression z-scores relative to all samples (log RNA Seq V2 RSEM)
## 10 Protein expression (RPPA)
## 11 Protein expression z-scores (RPPA)
## 12 Structural variants
## description
## 1 Putative arm-level copy-number from GISTIC 2.0.
## 2 Genetic ancestries were determined using five different methods as described in Carrot-Zhang et al (2020). These consensus calls were created based on the ancestral population that received the majority of assignments for each patient. The original data is <a href="https://gdc.cancer.gov/about-data/publications/CCG-AIM-2020">here</a>.
## 3 Putative copy-number from GISTIC 2.0. Values: -2 = homozygous deletion; -1 = hemizygous deletion; 0 = neutral / no change; 1 = gain; 2 = high level amplification.
## 4 Log2 copy-number values for each gene (from Affymetrix SNP6).
## 5 Methylation between-platform (hm27 and hm450) normalization values.
## 6 Mutation data from whole exome sequencing of 592 Glioblastoma samples.
## 7 mRNA Expression, RSEM (Batch normalized from Illumina HiSeq_RNASeqV2)
## 8 mRNA expression z-scores (RNA Seq V2 RSEM) compared to the expression distribution of each gene tumors that are diploid for this gene.
## 9 Log-transformed mRNA expression z-scores compared to the expression distribution of all samples (RNA Seq V2 RSEM).
## 10 Protein expression measured by reverse-phase protein array
## 11 Protein expression, measured by reverse-phase protein array, Z-scores
## 12 Structural Variant Data.
## showProfileInAnalysisTab patientLevel
## 1 TRUE FALSE
## 2 TRUE FALSE
## 3 TRUE FALSE
## 4 FALSE FALSE
## 5 TRUE FALSE
## 6 TRUE FALSE
## 7 FALSE FALSE
## 8 TRUE FALSE
## 9 TRUE FALSE
## 10 FALSE FALSE
## 11 TRUE FALSE
## 12 TRUE FALSE
## molecularProfileId
## 1 gbm_tcga_pan_can_atlas_2018_armlevel_cna
## 2 gbm_tcga_pan_can_atlas_2018_genetic_ancestry
## 3 gbm_tcga_pan_can_atlas_2018_gistic
## 4 gbm_tcga_pan_can_atlas_2018_log2CNA
## 5 gbm_tcga_pan_can_atlas_2018_methylation_hm27_hm450_merge
## 6 gbm_tcga_pan_can_atlas_2018_mutations
## 7 gbm_tcga_pan_can_atlas_2018_rna_seq_v2_mrna
## 8 gbm_tcga_pan_can_atlas_2018_rna_seq_v2_mrna_median_Zscores
## 9 gbm_tcga_pan_can_atlas_2018_rna_seq_v2_mrna_median_all_sample_Zscores
## 10 gbm_tcga_pan_can_atlas_2018_rppa
## 11 gbm_tcga_pan_can_atlas_2018_rppa_Zscores
## 12 gbm_tcga_pan_can_atlas_2018_structural_variants
## studyId sortOrder
## 1 gbm_tcga_pan_can_atlas_2018 <NA>
## 2 gbm_tcga_pan_can_atlas_2018 ASC
## 3 gbm_tcga_pan_can_atlas_2018 <NA>
## 4 gbm_tcga_pan_can_atlas_2018 <NA>
## 5 gbm_tcga_pan_can_atlas_2018 DESC
## 6 gbm_tcga_pan_can_atlas_2018 <NA>
## 7 gbm_tcga_pan_can_atlas_2018 <NA>
## 8 gbm_tcga_pan_can_atlas_2018 <NA>
## 9 gbm_tcga_pan_can_atlas_2018 <NA>
## 10 gbm_tcga_pan_can_atlas_2018 <NA>
## 11 gbm_tcga_pan_can_atlas_2018 <NA>
## 12 gbm_tcga_pan_can_atlas_2018 <NA>
Download data
Now that we have chosen our study we will be able to download its data.
Get clinical data
#get the list of all available samples
sampleList <- available_samples(study_id = study_id)
#get clinical data from the list of available samples
clinical_data <- get_clinical_by_sample(sample_id = sampleList$sampleId,
study_id = study_id)## ! No `clinical_attribute` passed. Defaulting to returning
## all clinical attributes in "gbm_tcga_pan_can_atlas_2018" study
#format to get one line per patient
clinical_data <- clinical_data %>% pivot_wider(names_from = "clinicalAttributeId") %>% as.data.frame()
#print first lines
head(clinical_data)## uniqueSampleKey
## 1 VENHQS0wMi0yNDY2LTAxOmdibV90Y2dhX3Bhbl9jYW5fYXRsYXNfMjAxOA
## 2 VENHQS0wMi0yNDcwLTAxOmdibV90Y2dhX3Bhbl9jYW5fYXRsYXNfMjAxOA
## 3 VENHQS0wMi0yNDgzLTAxOmdibV90Y2dhX3Bhbl9jYW5fYXRsYXNfMjAxOA
## 4 VENHQS0wMi0yNDg1LTAxOmdibV90Y2dhX3Bhbl9jYW5fYXRsYXNfMjAxOA
## 5 VENHQS0wMi0yNDg2LTAxOmdibV90Y2dhX3Bhbl9jYW5fYXRsYXNfMjAxOA
## 6 VENHQS0wNi0xMDg0LTAxOmdibV90Y2dhX3Bhbl9jYW5fYXRsYXNfMjAxOA
## uniquePatientKey sampleId
## 1 VENHQS0wMi0yNDY2OmdibV90Y2dhX3Bhbl9jYW5fYXRsYXNfMjAxOA TCGA-02-2466-01
## 2 VENHQS0wMi0yNDcwOmdibV90Y2dhX3Bhbl9jYW5fYXRsYXNfMjAxOA TCGA-02-2470-01
## 3 VENHQS0wMi0yNDgzOmdibV90Y2dhX3Bhbl9jYW5fYXRsYXNfMjAxOA TCGA-02-2483-01
## 4 VENHQS0wMi0yNDg1OmdibV90Y2dhX3Bhbl9jYW5fYXRsYXNfMjAxOA TCGA-02-2485-01
## 5 VENHQS0wMi0yNDg2OmdibV90Y2dhX3Bhbl9jYW5fYXRsYXNfMjAxOA TCGA-02-2486-01
## 6 VENHQS0wNi0xMDg0OmdibV90Y2dhX3Bhbl9jYW5fYXRsYXNfMjAxOA TCGA-06-1084-01
## patientId studyId ANEUPLOIDY_SCORE CANCER_TYPE
## 1 TCGA-02-2466 gbm_tcga_pan_can_atlas_2018 11 Glioblastoma
## 2 TCGA-02-2470 gbm_tcga_pan_can_atlas_2018 5 Glioblastoma
## 3 TCGA-02-2483 gbm_tcga_pan_can_atlas_2018 4 Glioblastoma
## 4 TCGA-02-2485 gbm_tcga_pan_can_atlas_2018 8 Glioblastoma
## 5 TCGA-02-2486 gbm_tcga_pan_can_atlas_2018 8 Glioblastoma
## 6 TCGA-06-1084 gbm_tcga_pan_can_atlas_2018 7 Glioblastoma
## CANCER_TYPE_DETAILED FRACTION_GENOME_ALTERED MSI_SCORE_MANTIS
## 1 Glioblastoma Multiforme 0.3380 0.2855
## 2 Glioblastoma Multiforme 0.1140 0.2735
## 3 Glioblastoma Multiforme 0.2253 0.2721
## 4 Glioblastoma Multiforme 0.1883 0.2728
## 5 Glioblastoma Multiforme 0.2043 0.2683
## 6 Glioblastoma Multiforme 0.2901 0.2907
## MSI_SENSOR_SCORE MUTATION_COUNT ONCOTREE_CODE SAMPLE_TYPE SOMATIC_STATUS
## 1 0.86 99 GBM Primary Matched
## 2 0.02 50 GBM Primary Matched
## 3 0.3 45 GBM Primary Matched
## 4 0.15 54 GBM Primary Matched
## 5 0.04 57 GBM Primary Matched
## 6 0.3 90 GBM Primary Matched
## TBL_SCORE TISSUE_SOURCE_SITE TISSUE_SOURCE_SITE_CODE TMB_NONSYNONYMOUS
## 1 93 MD Anderson Cancer Center 2 3.366666667
## 2 31 MD Anderson Cancer Center 2 1.7
## 3 102 MD Anderson Cancer Center 2 1.5
## 4 33 MD Anderson Cancer Center 2 1.833333333
## 5 75 MD Anderson Cancer Center 2 1.9
## 6 83 Henry Ford Hospital 6 3
## TUMOR_TISSUE_SITE TUMOR_TYPE
## 1 Brain Glioblastoma Multiforme (GBM), Treated
## 2 Brain Glioblastoma Multiforme (GBM), Treated
## 3 Brain Glioblastoma Multiforme (GBM), Untreated
## 4 Brain Glioblastoma Multiforme (GBM), Untreated
## 5 Brain Glioblastoma Multiforme (GBM), Untreated
## 6 Brain Glioblastoma Multiforme (GBM), Untreated
## TISSUE_PROSPECTIVE_COLLECTION_INDICATOR
## 1 <NA>
## 2 <NA>
## 3 <NA>
## 4 <NA>
## 5 <NA>
## 6 <NA>
## TISSUE_RETROSPECTIVE_COLLECTION_INDICATOR
## 1 <NA>
## 2 <NA>
## 3 <NA>
## 4 <NA>
## 5 <NA>
## 6 <NA>
Citation
If you use cBioPortal in your reseach don’t forget to cite them: - Cerami et al. The cBio Cancer Genomics Portal: An Open Platform for Exploring Multidimensional Cancer Genomics Data. Cancer Discovery. May 2012 2; 401. PubMed.https://pubmed.ncbi.nlm.nih.gov/22588877/ - Gao et al. Integrative analysis of complex cancer genomics and clinical profiles using the cBioPortal. Sci. Signal. 6, pl1 (2013). PubMed. https://pubmed.ncbi.nlm.nih.gov/23550210/ - de Bruijn et al. Analysis and Visualization of Longitudinal Genomic and Clinical Data from the AACR Project GENIE Biopharma Collaborative in cBioPortal. Cancer Res (2023). PubMed. https://pubmed.ncbi.nlm.nih.gov/37668528/
Remember also to cite the source of the data if you are using a publicly available dataset.
Ressources
For more information, please read :
cBioPortal FAQ: https://docs.cbioportal.org/user-guide/faq
cBioPortail tutorials: https://docs.cbioportal.org/user-guide/overview/
Script R: - https://www.karissawhiting.com/cbioportalR/articles/overview-of-workflow.html
- https://cran.r-project.org/web/packages/cbioportalR/readme/README.html